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Abstract 

Concerning different approaches to auto- 
matic PoS tagging: EngCG-2, a constraint- 
based morphological tagger, is compared in 
a double-blind test with a state-of-the-art 
statistical tagger on a common disambigua- 
tion task using a common tag set. The ex- 
periments show that for the same amount 
of remaining ambiguity, the error rate of 
the statistical tagger is one order of mag- 
nitude greater than that of the rule-based 
one. The two related issues of priming 
effects compromising the results and dis- 
agreement between human annotators are 
also addressed. 

1 Introduction! 

There are currently two main methods for auto- 
matic part-of-speech tagging. The prevailing one 
uses essentially statistical language models automat- 
ically derived from usually hand-annotated corpora. 
These corpus-based models can be represented e.g. 
as collocational matrices (Garside et al. (eds.) 1987; 
Church 1988), Hidden Markov models (cf. Cutting 
et al. 1992), local rules (e.g. Hindle 1989) and neu- 
ral networks (e.g. Schmid 1994). Taggers using these 
statistical language models are generally reported to 
assign the correct and unique tag to 95-97% of words 
in running text, using tag sets ranging from some 
dozens to about 130 tags. 

The less popular approach is based on hand-coded 
linguistic rules. Pioneering work was done in the 
1960's (e.g. Greene and Rubin 1971). Recently, new 
interest in the linguistic approach has been shown 
e.g. in the work of (Karlsson 1990; Voutilaincn et 
al. 1992; Oflazer and Kuruoz 1994; Chanod and 
Tapanainen 1995; Karlsson et al. (eds.) 1995; Vouti- 
lainen 1995). The first serious linguistic competitor 

1 Published in Proceedings of 35th Annual Meeting of 
the Association for Computational Linguistics and 8th 
Conference of the European Chapter of the Association 
for Computational Linguistics. ACL, Madrid. 
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to data-driven statistical taggers is the English Con- 
straint Grammar parser, EngCG (cf. Voutilainen et 
al. 1992; Karlsson et al. (eds.) 1995). The tagger 
consists of the following sequentially applied mod- 
ules: 

1. Tokenisation 

2. Morphological analysis 

(a) Lexical component 

(b) Rule-based guesser for unknown words 

3. Resolution of morphological ambiguities 

The tagger uses a two-level morphological anal- 
yser with a large lexicon and a morphological 
description that introduces about 180 different 
ambiguity-forming morphological analyses, as a re- 
sult of which each word gets 1.7-2.2 different analy- 
ses on an average. Morphological analyses are as- 
signed to unknown words with an accurate rule- 
based 'guesser'. The morphological disambiguator 
uses constraint rules that discard illegitimate mor- 
phological analyses on the basis of local or global 
context conditions. The rules can be grouped as 
ordered subgrammars: e.g. heuristic subgrammar 2 
can be applied for resolving ambiguities left pending 
by the more 'careful' subgrammar 1. 

Older versions of EngCG (using about 1,150 con- 
straints) are reported (Voutilainen et al. 1992; Vouti- 
lainen and Heikkila, 1994; Tapanainen and Vouti- 
lainen 1994; Voutilainen 1995) to assign a correct 
analysis to about 99.7% of all words while each word 
in the output retains 1.04-1.09 alternative analyses 
on an average, i.e. some of the ambiguities remain 
unresolved. 

These results have been seriously questioned. One 
doubt concerns the notion "correct analysis" . For 
example Church (1992) argues that linguists who 
manually perform the tagging task using the double- 
blind method disagree about the correct analysis in 
at least 3% of all words even after they have nego- 
tiated about the initial disagreements. If this were 
the case, reporting accuracies above this 97% 'upper 
bound' would make no sense. 

However, Voutilainen and Jarvinen (1995) empir- 
ically show that an interjudge agreement virtually 



of 100% is possible, at least with the EngCG tag set 
if not with the original Brown Corpus tag set. This 
consistent applicability of the EngCG tag set is ex- 
plained by characterising it as grammatically rather 
than semantically motivated. 

Another main reservation about the EngCG fig- 
ures is the suspicion that, perhaps partly due to the 
somewhat underspecific nature of the EngCG tag 
set, it must be so easy to disambiguate that also a 
statistical tagger using the EngCG tags would reach 
at least as good results. This argument will be ex- 
amined in this paper. It will be empirically shown 
(i) that the EngCG tag set is about as difficult for a 
probabilistic tagger as more generally used tag sets 
and (ii) that the EngCG disambiguator has a clearly 
smaller error rate than the probabilistic tagger when 
a similar (small) amount of ambiguity is permitted 
in the output. 

A state-of-the-art statistical tagger is trained on 
a corpus of over 350,000 words hand-annotated with 
EngCG tags, then-. both taggers (a new version 
known as EngCG- 20 with 3,600 constraints as five 
subgrammarsp, and a statistical tagger) are applied 
to the same held-out benchmark corpus of 55,000 
words, and their performances are compared. The 
results disconfirm the suspected 'easiness' of the 
EngCG tag set: the statistical tagger's performance 
figures are no better than is the case with better 
known tag sets. 

Two caveats are in order. What we are not ad- 
dressing in this paper is the work load required for 
making a rule-based or a data-driven tagger. The 
rules in EngCG certainly took a considerable effort 
to write, and though at the present state of knowl- 
edge rules could be written and tested with less ef- 
fort, it may well be the case that a tagger with an 
accuracy of 95-97% can be produced with less effort 
by using data-driven techniques!] 

Another caveat is that EngCG alone does not re- 
solve all ambiguities, so it cannot be compared to a 
typical statistical tagger if full disambiguation is re- 
quired. However, Voutilainen (1995) has shown that 
EngCG combined with a syntactic parser produces 
morphologically unambiguous output with an accu- 
racy of 99.3%, a figure clearly better than that of the 
statistical tagger in the experiments below (however, 
the test data was not the same) . 

Before examining the statistical tagger, two prac- 
tical points are addressed: the annotation of the cor- 
pora used, and the modification of the EngCG tag 
set for use in a statistical tagger. 

2 An online version of EngCG-2 can be found at 
http://www.ling.helsinki.fi/~avoutila/engcg-2.html. 

3 The first three subgrammars are generally highly re- 
liable and almost all of the total grammar development 
time was spent on them; the last two contain rather 
rough heuristic constraints. 

4 However, for an interesting experiment suggesting 
otherwise, see (Chanod and Tapanainen 1995). 



2 Preparation of Corpus Resources 

2.1 Annotation of training corpus 

The stochastic tagger was trained on a sample of 
357,000 words from the Brown University Corpus 
of Present-Day English (Francis and Kucera 1982) 
that was annotated using the EngCG tags. The cor- 
pus was first analysed with the EngCG lexical anal- 
yser, and then it was fully disambiguated and, when 
necessary, corrected by a human expert. This an- 
notation took place a few years ago. Since then, it 
has been used in the development of new EngCG 
constraints (the present version, EngCG-2, contains 
about 3,600 constraints): new constraints were ap- 
plied to the training corpus, and whenever a reading 
marked as correct was discarded, either the analysis 
in the corpus, or the constraint itself, was corrected. 
In this way, the tagging quality of the corpus was 
continuously improved. 

2.2 Annotation of benchmark corpus 

Our comparisons use a held-out benchmark corpus 
of about 55,000 words of journalistic, scientific and 
manual texts, i.e., no training effects are expected 
for either system. The benchmark corpus was an- 
notated by first applying the preprocessor and mor- 
phological analyser, but not the morphological dis- 
ambiguator, to the text. This morphologically am- 
biguous text was then independently and fully dis- 
ambiguated by two experts whose task was also to 
detect any errors potentially produced by the pre- 
viously applied components. They worked indepen- 
dently, consulting written documentation of the tag 
set when necessary. Then these manually disam- 
biguated versions were automatically compared with 
each other. At this stage, about 99.3% of all anal- 
yses were identical. When the differences were col- 
lectively examined, virtually all were agreed to be 
due to clerical mistakes. Only in the analysis of 21 
words, different (meaning-level) interpretations per- 
sisted, and even here both judges agreed the ambigu- 
ity to be genuine. One of these two corpus versions 
was modified to represent the consensus, and this 
'consensus corpus' was used as a benchmark in the 
evaluations. 

As explained in Voutilainen and Jarvinen (1995), 
this high agreement rate is due to two main factors. 
Firstly, distinctions based on some kind of vague se- 
mantics are avoided, which is not always case with 
better known tag sets. Secondly, the adopted analy- 
sis of most of the constructions where humans tend 
to be uncertain is documented as a collection of tag 
application principles in the form of a grammar- 
ian's manual (for further details, cf. Voutilainen and 
Jarvinen 1995). 

The corpus-annotation procedure allows us to per- 
form a text-book statistical hypothesis test. Let 
the null hypothesis be that any two human eval- 
uators will necessarily disagree in at least 3% of 



the cases. Under this assumption, the probability 
of an observed disagreement of less than 2.88% is 
less than 5%. This can be seen as follows: For 
the relative frequency of disagreement, /„, we have 

that /„ is approximately ~ N(p, \J P< ^ 1 ~ v ^ ), where p 
is the actual disagreement probability and n is the 
number of trials, i.e., the corpus size. This means 
f n — V 

that P(( y/n < x) « <&{x) where $ is the 

vpO--p) 

standard normal distribution function. This in turn 
means that 



P(fn <P + X 



p(l -p). 



Here n is 55,000 and $(-1,645) = 0.05. Under the 
null hypothesis, p is at least 3% and thus: 



P(/»< 0.03- 1.645/- 03 -°- 97 ) = 
yJ V 55,000 ; 

= P(fn < 0.0288) < 0.05 

We can thus discard the null hypothesis at signifi- 
cance level 5% if the observed disagreement is less 
than 2.88%. It was in fact 0.7% before error cor- 



rection, and virtually zero (- 



21 



-) after negotia- 



v 55, 000' 

tion. This means that we can actually discard the 
hypotheses that the human evaluators in average 
disagree in at least 0.8% of the cases before error 
correction, and in at least 0.1% of the cases after 
negotiations, at significance level 5%. 

2.3 Tag set conversion 

The EngCG morphological analyser's output for- 
mally differs from most tagged corpora; consider the 
following 5-ways ambiguous analysis of "walk" : 

walk 

walk <SV> <SV0> V SUBJUNCTIVE VFIN 

walk <SV> <SV0> V 

walk <SV> <SV0> V 

walk <SV> <SV0> V PRES 
walk N NOM SG 



IMP VFIN 
INF 

-SG3 VFIN 



Statistical taggers usually employ single tags to 
indicate analyses (e.g. "NN" for "N NOM SG"). 
Therefore a simple conversion program was made for 
producing the following kind of output, where each 
reading is represented as a single tag: 

walk V-SUBJUNCTIVE V-IMP V-INF 
V-PRES-BASE N-NOM-SG 

The conversion program reduces the multipart 
EngCG tags into a set of 80 word tags and 17 punc- 
tuation tags (see Appendix) that retain the central 
linguistic characteristics of the original EngCG tag 
set. 



A reduced version of the benchmark corpus was 
prepared with this conversion program for the sta- 
tistical tagger's use. Also EngCG's output was con- 
verted into this format to enable direct comparison 
with the statistical tagger. 

3 The Statistical Tagger 

The statistical tagger used in the experiments is a 
classical trigram-based HMM decoder of the kind 
described in e.g. (Church 1988), (DeRose 1988) and 
numerous other articles. Following conventional no- 
tation, e.g. (Rabiner 1989, pp. 272-274) and (Krenn 
and Samuelsson 1996, pp. 42-46), the tagger recur- 
sively calculates the a, /3, 7 and 5 variables for each 
word-string position t = 1 , . . . , T and each possible 
i = 1 n: 



stateO Si 
a t (i) 



P(W; S t = Si 
P(W) 



Here 

W 

w< t 
w >t 

where St ■ 



P(W<t;St = Sl ) 
P(W>, 1 S t = Si ) 

P(S t = Si I W) 

a t (i) ■ fitji) 
n 

i=l 

max P(S< t -i,S t = Si',W< t ) 
S< t _i 



Wi = w kl 
Wi = w kl 
W t +i = w kt 
Si = s n 



W T = w kT 
W t = w kt 
+1 ,...,Wt = w kl 
S t = s it 



s-i is the event of the tth word being 
emitted from state and W t = w kt is the event of 
the tth word being the particular word w kt that was 
actually observed in the word string. 

Note that for t = 1, . . . ,T — 1 ; i, j = 1, . . . , n 



«t+i(i) = 



E 



atii) -Pij 



Pt+lti) 'Pij ■ Ojftt+i 



(J) 



max<5 t (z) ■ p tj 



where p^ = P(St+i = Sj \ St = s,) are the transi- 
tion probabilities, encoding the tag N-gram proba- 
bilities, and 



= p(W t = w k I S t = 8j ) 



P{W t =w k \X t = x ) 



The Af-_Zth-order HMM corresponding to an N-gram 
tagger is encoded as a first-order HMM, where each state 
corresponds to a sequence of N-l tags, i.e., for a trigram 
tagger, each state corresponds to a tag pair. 



are the lexical probabilities. Here X t is the random 
variable of assigning a tag to the tth word and Xj is 
the last tag of the tag sequence encoded as state Sj . 
Note that Si ^ Sj need not imply Xi 7^ Xj . 

More precisely, the tagger employs the converse 
lexical probabilities 



P(X t = xj I W t = w k ) 
P(X t = Xj ) 



a 3 k 



P{W t = w k ) 



This results in slight variants a', 7' and 6' of the 
original quantities: 



a t (i) 



= J\P(W u = w kii ) 



a' t (i) 6>(i) 

T 

Yl P(W u =w k J 



0t(i) 



and thus Vz, t 



u=t+l 



qj(i) •/%(*) 

n 
i=l 

a t {i) ■ f3 t (i) 



lt{i) 



and Vi 



»=i 



argmax<5j(i) 

Ki<n 



a t {i) ■ f3t(i) 



argmax5 t (i) 

Ki<n 



The rationale behind this is to facilitate estimat- 
ing the model parameters from sparse data. In more 
detail, it is easy to estimate P{tag \ word) for a pre- 
viously unseen word by backing off to statistics de- 
rived from words that end with the same sequence 
of letters (or based on other surface cues) , whereas 
directly estimating P(word \ tag) is more difficult. 
This is particularly useful for languages with a rich 
inflectional and derivational morphology, but also 
for English: for example, the suffix "-tion" is a 
strong indicator that the word in question is a noun; 
the suffix "-able" that it is an adjective. 

More technically, the lexicon is organised as a 
reverse-suffix tree, and smoothing the probability es- 
timates is accomplished by blending the distribution 
at the current node of the tree with that of higher- 
level nodes, corresponding to (shorter) suffixes of the 
current word (suffix). The scheme also incorporates 
probability distributions for the set of capitalized 
words, the set of all-caps words and the set of in- 
frequent words, all of which are used to improve the 
estimates for unknown words. Employing a small 
amount of back-off smoothing also for the known 
words is useful to reduce lexical tag omissions. Em- 
pirically, looking two branching points up the tree 
for known words, and all the way up to the root 



for unknown words, proved optimal. The method 
for blending the distributions applies equally well to 
smoothing the transition probabilities pij, i.e., the 
tag N-gram probabilities, and both the scheme and 
its application to these two tasks are described in de- 
tail in (Samuelsson 1996), where it was also shown 
to compare favourably to (deleted) interpolation, see 
(Jelinek and Mercer 1980), even when the back-off 
weights of the latter were optimal. 

The 5 variables enable finding the most probable 
state sequence under the HMM, from which the most 
likely assignment of tags to words can be directly es- 
tablished. This is the normal modus operandi of an 
HMM decoder. Using the 7 variables, we can calcu- 
late the probability of being in state at string po- 
sition t, and thus having emitted w kt from this state, 
conditional on the entire word string. By summing 
over all states that would assign the same tag to this 
word, the individual probability of each tag being as- 
signed to any particular input word, conditional on 
the entire word string, can be calculated: 

P(X t = Xi\W) = 



£ m 



Si I W) 



£ 7t(3) 



This allows retaining multiple tags for each word by 
simply discarding only low-probability tags; those 
whose probabilities are below some threshold value. 
Of course, the most probable tag is never discarded, 
even if its probability happens to be less than the 
threshold value. By varying the threshold, we can 
perform a recall-precision, or error-rate-ambiguity, 
tradeoff. A similar strategy is adopted in (de Mar- 
cken 1990). 

4 Experiments 

The statistical tagger was trained on 357,000 words 
from the Brown corpus (Francis and Kucera 1982), 
reannotated using the EngCG annotation scheme 
(see above). In a first set of experiments, a 35,000 
word subset of this corpus was set aside and used to 
evaluate the tagger's performance when trained on 
successively larger portions of the remaining 322,000 
words. The learning curve, showing the error rate af- 
ter full disambiguation as a function of the amount 
of training data used, see Figure || has levelled off at 
322,000 words, indicating that little is to be gained 
from further training. We also note that the ab- 
solute value of the error rate is 3.51% — a typi- 
cal state-of-the-art figure. Here, previously unseen 
words contribute 1.08% to the total error rate, while 
the contribution from lexical tag omissions is 0.08%. 
95% confidence intervals for the error rates would 
range from ± 0.30% for 30,000 words to ± 0.20% at 
322,000 words. 

The tagger was then trained on the entire set 
of 357,000 words and confronted with the separate 
55, 000- word benchmark corpus, and run both in full 
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Figure 1: Learning curve for the statistical tagger 
on the Brown corpus. 
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(7) 


1.000 
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1.012 




4.20 


1.025 




3.75 


1.026 




(3.72) 0.43 


1.035 




(3.48) 0.29 


1.038 




3.40 


1.048 




(3.20) 0.15 


1.051 




3.14 


1.059 




(2.99) 0.12 


1.065 




2.87 


1.070 




(2.80) 0.10 


1.078 




2.69 


1.093 




2.55 



Table 1 : Error-rate-ambiguity tradeoff for both tag- 
gers on the benchmark corpus. Parenthesized num- 
bers are interpolated. 



and partial disambiguation mode. Table |l| shows 
the error rate as a function of remaining ambiguity 
(tags/word) both for the statistical tagger, and for 
the EngCG-2 tagger. The error rate for full disam- 
biguation using the 5 variables is 4.72% and using 
the 7 variables is 4.68%, both ±0.18% with confi- 
dence degree 95%. Note that the optimal tag se- 
quence obtained using the 7 variables need not equal 
the optimal tag sequence obtained using the S vari- 
ables. In fact, the former sequence may be assigned 
zero probability by the HMM, namely if one of its 
state transitions has zero probability. 

Previously unseen words account for 2.01%, and 
lexical tag omissions for 0.15% of the total error rate. 
These two error sources are together exactly 1.00% 
higher on the benchmark corpus than on the Brown 
corpus, and account for almost the entire difference 



in error rate. They stem from using less complete 
lexical information sources, and are most likely the 
effect of a larger vocabulary overlap between the test 
and training portions of the Brown corpus than be- 
tween the Brown and benchmark corpora. 

The ratio between the error rates of the two tag- 
gers with the same amount of remaining ambiguity 
ranges from 8.6 at 1.026 tags/word to 28.0 at 1.070 
tags/word. The error rate of the statistical tagger 
can be further decreased, at the price of increased 
remaining ambiguity, see Figure 0. In the limit of 
retaining all possible tags, the residual error rate is 
entirely due to lexical tag omissions, i.e., it is 0.15%, 
with in average 14.24 tags per word. The reason 
that this figure is so high is that the unknown words, 
which comprise 10% of the corpus, are assigned all 
possible tags as they are backed off all the way to 
the root of the reverse-suffix tree. 



Error-rate-ambiguity trade-off 
„ 5 I 1 1 1 1 1 1 1- 

o\° 

" 4 - 

CD 

% 3 " 

u 



2 4 6 8 10 12 14 
Remaining ambiguity (Tags/Word) 

Figure 2: Error-rate-ambiguity tradeoff for the sta- 
tistical tagger on the benchmark corpus. 



5 Discussion 

Recently voiced scepticisms concerning the superior 
EngCG tagging results boil down to the following: 

• The reported results are due to the simplicity 
of the tag set employed by the EngCG system. 

• The reported results are an effect of trading 
high ambiguity resolution for lower error rate. 

• The results are an effect of so-called priming 
of the human annotators when preparing the 
test corpora, compromising the integrity of the 
experimental evaluations. 

In the current article, these points of criticism 
were investigated. A state-of-the-art statistical 
tagger, capable of performing error-rate-ambiguity 
tradeoff, was trained on a 357,000-word portion of 
the Brown corpus reannotated with the EngCG tag 
set, and both taggers were evaluated using a sep- 
arate 55,000-word benchmark corpus new to both 



systems. This benchmark corpus was independently 
disambiguated by two linguists, without access to 
the results of the automatic taggers. The initial 
differences between the linguists' outputs (0.7% of 
all words) were jointly examined by the linguists; 
practically all of them turned out to be clerical er- 
rors (rather than the product of genuine difference 
of opinion). 

In the experiments, the performance of the 
EngCG-2 tagger was radically better than that of 
the statistical tagger: at ambiguity levels common 
to both systems, the error rate of the statistical tag- 
ger was 8.6 to 28 times higher than that of EngCG- 
2. We conclude that neither the tag set used by 
EngCG-2, nor the error-rate-ambiguity tradeoff, nor 
any priming effects can possibly explain the observed 
difference in performance. 

Instead we must conclude that the lexical and con- 
textual information sources at the disposal of the 
EngCG system are superior. Investigating this em- 
pirically by granting the statistical tagger access to 
the same information sources as those available in 
the Constraint Grammar framework constitutes fu- 
ture work. 
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Appendix: Reduced EngCG tag set 



Punctuation tags: 

©colon 

©comma 

©dash 

©dotdot 

©dquote 

©exclamation 

©fullstop 

©lparen 

©rparen 

©rparen 

©rparen 

©rparen 

©lquote 

©rquote 

©slash 

©newlines 

©question 

©semicolon 

Word tags: 

A-ABS 

A-CMP 

A-SUP 

ABBR-GEN-SG/PL 

ABBR-GEN-PL 

ABBR-GEN-SG 

ABBR-NOM-SG/PL 

ABBR-NOM-PL 

ABBR-NOM-SG 

ADV-ABS 

ADV-CMP 

ADV-SUP 

ADV-WH 

BE-EN 



BE-IMP 
BE- INF 
BE-ING 

BE-PAST-BASE 

BE-PAST-WAS 

BE-PRES-AM 

BE-PRES-ARE 

BE-PRES-IS 

BE-SUBJUNCTIVE 

CC 

CCX 

CS 

DET-SG/PL 

DET-SG 

DET-WH 

DO-EN 

DO-IMP 

DO-INF 

DO-ING 

DO-PAST 

DO-PRES-BASE 

DO-PRES-SG3 

DO-SUBJUNCTIVE 

EN 

HAVE-EN 

HAVE-IMP 

HAVE-INF 

HAVE-ING 

HAVE-PAST 

HAVE-PRES-BASE 

HAVE-PRES-SG3 

HAVE-SUBJUNCTIVE 

I 

INFMARK 



ING 

N-GEN-SG/PL 

N-GEN-PL 

N-GEN-SG 

N-NOM-SG/PL 

N-NOM-PL 

N-NOM-SG 

NEG 

NUM-CARD 

NUM-FRA-PL 

NUM-FRA-SG 

NUM-ORD 

PREP 

PRON 

PRON-ACC 

PRON-CMP 

PRON-DEM-PL 

PRON-DEM-SG 

PRON-GEN 

PRON-INTERR 

PRON-NOM-SG/PL 

PRON-NOM-PL 

PRON-NOM-SG 

PRON-REL 

PRON-SUP 

PRON-WH 

V-AUXMOD 

V-IMP 

V-INF 

V-PAST 

V-PRES-BASE 

V-PRES-SG1 

V-PRES-SG2 

V-PRES-SG3 

V-SUBJUNCTIVE 



